1 Executive Summary

1.1 Goals and Introduction

You work for a large corporation that owns a collection of restaurants of different types. Currently it is evaluating the location, type, characteristics (cuisine, price point, design, marketing strategy) and positioning of a new restaurant in Edinburgh (if you know the local habits, you can give better insight). Your task is to analyse the dataset and give recommendations on strategy, based on the reviews, location (either neighbourhood or zipcode level) and competition. Also give estimates on volumes and revenues of the potential undertaking. You may have to check out what the price-range attributes signify by visiting the Yelp website.

1.2 Methodology

We decided to perform this analysis for Edinburgh. This was because we found extensive demographic data that could be easily integrated with the Yelp data. We also had better knowledge of the zip codes as Edinburgh is still in the UK. The Yelp data were provided as JSON files.

2 Review and data exploration -Analysis

This analysis points out demand and preference of customers from a large amount of reviews, with high dimensionality. These topics can provide meaningful insights to opening a new restaurant by considering what customers care about in order to increase the Yelp ratings, which directly affects the revenue. But how can restaurant understand the demands of its customers from a large amount of reviews? For a relatively small collection of reviews, it may be possible to manually inspect and classify the contents of reviews into specific categories based on similarity. But to partition large volumes of text, the process would be extremely time consuming. Topic modelling greatly reduces the time needed to perform the classification and understand the actual contents. We hope to use topic modelling to identify what users care about most when giving their rating stars, and ultimately determine what a new restaurant should be doing in order to receive high ratings.

In this study, we applied a non-negative matrix factorization (NMF) approach for the extraction and detection of concepts or topics from reviews. NMF introduces a technique that simultaneously perform dimension reduction and clustering that identifies semantic features in a document collection and groups the documents into clusters on the basis of shared semantic features [1]. The extracted topics from 1-star and 2-star reviews were used as an indicator of bad practice whereas extracted topics from 4-star and 5-star reviews were used as an indicator of good practice for operating a restaurant.

2.1 Description of datasets

Import the datasets.

business = pd.read_csv("edinburgh.csv",header=0)
checkin = pd.read_csv("edinCheckin.csv",header=0)
review = pd.read_csv("edinReview.csv",header=0)
tip = pd.read_csv("edinTip.csv",header=0)
user = pd.read_csv("edinUser.csv",header=0,usecols=range(0,23))

The main dataset in this study was ‘yelp_academic_dataset_review.json’ and summary of the dataset is shown below.

# Replace nan with blank space
review=review.replace(np.nan,' ', regex=True)

# Summary of dataset
review.info()

The text of reviews is full of punctuations, numbers and capital letters, further cleansing of data is required for text analysis.

review.text.head(10)

2.2 Data Cleansing

Clean up the text in review dataset.

# Referenced Regular Expression for email cleanup idea: 
def cleanup(text):

    # Make text lower case
    for f in re.findall("([A-Z]+)", text):
        text = text.replace(f, f.lower())
    
    # Remove escape symbols
    text = text.replace('\r', " ")
    text = text.replace('\n', " ")
    
    # Remove all non-ascii characters in the string
    text=unicode(text, 'ascii', 'ignore')

    # Creata a list of reg tools
    cleanuptools = [
    # Dates
    r"(monday|tuesday|wednesday|thursday|friday|saturday|sunday)",
    # Removing months
    r"january|february|march|april|may|june|july|august|september|october|november|december",
    # Punctuation and numbers to be removed
    r'[-|.|?|!|,|"|:|;|()|0-9]',
    ]
    
    for tool in cleanuptools:
        text = re.sub(tool," ", text)
    return text


# Constructing a list for stopwords 
stopwords = []

# Add scikit-learn's CountVectorizer's stop list to the created list
stopwords = sk.feature_extraction.text.ENGLISH_STOP_WORDS

# Apply the created functions to clean up text
review.text=review.text.apply(cleanup)

# Cleaned text
review.text[28]

Text after cleansing:

2.3 Inspection of review contents

2.3.1 Distribution of number of words

It is crucial to only include reviews with a considerable amount of contents as the topic modelling will not be able to extract any insight from a short review. The length of reviews was therefore analysed and a minimum requirement on the length of review was implemented.

f, ax = plt.subplots(figsize=(15,7.5))
n, bins, patches = ax.hist(review.text.apply(len),facecolor='black',bins=20)
ax.set_xticks(bins)
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(n, bin_centers):
    percent = '{:.2f}%'.format((float(count) / n.sum())*100)
    ax.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
    xytext=(0, -32), textcoords='offset points', va='top', ha='center')
ax.set_xlabel('Count of reviews')
ax.set_ylabel('Count of words in review')

It can be seen that almost 95% of reviews have length of more than 500 words, therefore, it is not necessary to remove any short reviews. The distribution of ratings was also examined to avoid inbalanced datasets.

2.3.2 Reviews VS Stars

review.stars.value_counts(sort=False).plot(kind='bar',color="black")
plt.title('Reviews By Star');

Inituitively, we thought there would be more 1-star and 5-star reviews, however, the data shows otherwise, and the majority of reviews were 4-star and 5-star. It is also important to check the usefulness/reliability of reviews by examining their numbers of votes.

2.3.3 Reviews VS Votes

The majority of reviews have only one vote regardless of vote type (cool, useful or funny). So we decided not to remove reviews based on their numbers of votes (doing so would remove a significant amount of text from this analysis).

2.3.3.1 Vote for cool

pd.crosstab(review.stars,review.votes_cool[review.votes_cool!=0],margins=True)

2.3.3.2 Vote for funny

pd.crosstab(review.stars,review.votes_funny[review.votes_funny!=0],margins=True)

2.3.3.3 Vote for useful

pd.crosstab(review.stars,review.votes_useful[review.votes_useful!=0],margins=True)

2.4 Methodology and Procedure

2.4.1 Dividing dataset

Reviews were split into two groups, 1 and 2 star reviews were interpreted as bad reviews and 4 and 5-star reviews were interpreted as good reviews. 3-star reviews were excluded in this analysis due to the mixed expression of customers (three stars could be a good or bad review).

# Split the dataset into 2 categories: 1,2 stars and 4,5 stars

review_bad = review[(review.stars == 1) | (review.stars == 2)]
review_good = review[(review.stars == 4) | (review.stars == 5)]

review_good.text.head(10)

Example of good reviews:

2.4.2 Text feature extraction

A range of functions from the python scikit-learn package was used to extract features(words) from textual reviews content and create a document-term matrix, namely [1]-[3]:

  • tokenizing strings and documenting each possible token in a matrix, where white-space and punctuation were token separators.
  • counting the occurrences of tokens in each document.
  • normalizing and weighting with diminishing importance tokens which occur in the majority of samples / documents.

In a large text corpus, stopwords are common (e.g. “the”, “a”, “is”, “and” in English) and carry little meaningful information about the actual contents of the document. In order to reduce the influence of stop words appearing frequently across the entire corpus, TF-IDF term re-weighting functions normalized the data. TfidfVectorizer function combines the functions of TF-IDF and vectorization (Vectorization is a process combining tokenization, counting and normalization) allowing to build a document-term matrix for the corpus of documents:

#Vectorization
tfidfvectorizer_bad = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
                                 strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)
tfidfvectorizer_good = TfidfVectorizer(max_features=15000, ngram_range=(1, 2), stop_words = stopwords,
                                 strip_accents="unicode", use_idf=True, norm="l2", min_df = 5)

#Create term document matrix for separate datasets
term_document_matrix_bad = tfidfvectorizer_bad.fit_transform(review_bad.text)
term_document_matrix_good = tfidfvectorizer_good.fit_transform(review_good.text)

2.4.3 Topic Modelling

NMF (Non-negative Matix Factorization) is a method of unsupervised learning for grouping a collection of documents and finding out abstract topics. We applied the scikit-learn implementation of NMF with NNDSVD initialization.

Nonnegative Double Singular Value Decomposition (NNDSVD) is typically used for overcoming sparseness of data in document-term matrix [8]. Here we set the number of topics to be equal to 25 and run NMF for 200 iterations, then get the factors W and H from the resulting model:

# 1-star + 2-star reviews
nmfmodel_bad = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_bad)
W_bad = nmfmodel_bad.fit_transform(term_document_matrix_bad);
H_bad = nmfmodel_bad.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for bad reviews" \
% ( str(W_bad.shape), str(H_bad.shape) )

# 4-star + 5-star reviews
nmfmodel_good = sk.decomposition.NMF(init="nndsvd", n_components=25, max_iter=200).fit(term_document_matrix_good)
W_good = nmfmodel_good.fit_transform(term_document_matrix_good);
H_good = nmfmodel_good.components_
# W (number of reviews,number of topics) and H (number of topics, number of features)
print "Generated factor W of size %s and factor H of size %s for good reviews" \
% ( str(W_good.shape), str(H_good.shape) )

2.5 Results

2.5.1 Top Terms by each topic

#Write functions to plot graphs which present the top topics discovered by the NMF model and create a list of features generated from the NMF model.

# Create a colour series for graph plotting
def grey_color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    return "hsl(0, 0%%, %d%%)" % random.randint(60, 100)

# Writting functions to show top words in each topic
def TopTermsByTopic(nmfmodel, features, top):
    for index, topic in enumerate(nmfmodel.components_):
        print "\n Topic {}: \n".format(index+1)
        print "Percentage of Words: {:.2%}\n".format(np.count_nonzero(topic) / 41961.)
        
        top_words = [features[i] for i in nmfmodel.components_[index].argsort()[::-1][:top]]
        topic_words = ' '.join(top_words)

        #Prepare data for horizontal bar charts
        top15_index = nmfmodel.components_[index].argsort()[::-1][:15]
        top15_topic = sorted(topic[top15_index],reverse=False)

        #Prepare data for wordclouds
        wc = WordCloud(max_font_size=80,relative_scaling=.5,width=800,height=500).generate(topic_words)
       
        #Create a space for graphs
        fig, ax = plt.subplots(2,figsize=(12,10))
        rect1 = ax[0].barh(.5 + np.arange(15) + .5, top15_topic, color="black", align="center")
        rect2 = ax[1].imshow(wc.recolor(color_func=grey_color_func, random_state=3))
        
        #Subplot 1 - Horizontal Bar Chart
        ax[0].set_title("Top 15 Terms in Topic {}".format(index + 1))
        ax[0].set_xlabel("Weight")
        ax[0].set_yticks(.5 + np.arange(15)+ .5)
        ax[0].set_yticklabels([features[i] for i in topic.argsort()[::-1][:15]])
        ax[0].grid(True)
        
        #Subplot 2 - WordCloud
        ax[1].axis("off")
        ax[1].set_title("Wordcloud of Topic {}".format(index + 1))
        
        #Show the graphs
        plt.tight_layout()
        plt.show()
        
        
#Extracting the feature names
features_bad= tfidfvectorizer_bad.get_feature_names()
features_good= tfidfvectorizer_good.get_feature_names()

Horizontal barchart of the top 15 highest weighted terms for the most significant topic - in bad and good reviews - discovered by the NMF model. The rest of the terms are shown in a word cloud.

2.5.1.1 Bad Reviews: 1-star and 2-star reviews

TopTermsByTopic(nmfmodel_bad, features_bad, 100)

Example of Topic 1 for bad reviews:

2.5.1.2 Good Reviews: 4-star and 5-star reviews

TopTermsByTopic(nmfmodel_good, features_good, 100)

Example of Topic 1 for good reviews:

2.5.2 Summary of topics

A summary of the 25 topics from the bad and good reviews - NMF model. An attempt to interpret the contents of each topic was made, keywords in each topic were manually examined and a description of each topic was then assigned. Experiments suggest 25 topics is optimal, allowing for a clear separation of topics.

2.5.2.1 Bad Reviews: 1-star and 2-star reviews

data_bad = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
                           ,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
                           ,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
                            'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
        ,'Type of Topics':['Bad service from manager and waiter/waitress ', 'Bad Coffee Shop',
                           'Bad Chicken Dishes (Fried, Boiled, Curry)',
                           'Bad Italian Foods (Pizza and Pasta: Toppings, base, sauce, etc.)',
                           'Bad Fried Chips (Soggy Batter)','Bad Burger', 'Unknown Topic',
                           'Bad wait and time management','Bad Experience and Services',
                           'Bad Restaurant (Birthplace of Harry Potter)','Bad Afternoon Tea',
                           'Unfriendly and Rude Staffs','Bad Chinese Sweet and Sour Foods',
                           'Unknown Topic','Bad Mexican Foods','Bad Place for drinks (too quiet)',
                           'Bad Japanese Foods (Tuna, Miso Soup)','Bad Prices','Bad Breakfast (Eggs Benedict)',
                           'Bad Wait and Time Management','Bad Noodles', 'Good Comments in Bad Reviews',
                           'Expensive Place for Tourists', 'Hot Temperature','Bad Thai Foods'                           
                           ]
       } 

topic_table_bad = pd.DataFrame(data_bad)
display(topic_table_bad)

2.5.2.2 Good Reviews: 4-star and 5-star reviews

data_good = {'Index of Topics':['Topic 1','Topic 2','Topic 3','Topic 4','Topic 5','Topic 6','Topic 7','Topic 8'
                           ,'Topic 9','Topic 10','Topic 11','Topic 12','Topic 13','Topic 14','Topic 15'
                           ,'Topic 16','Topic 17','Topic 18','Topic 19','Topic 20',
                            'Topic 21','Topic 22','Topic 23','Topic 24','Topic 25']
        ,'Type of Topics':['Unknown Topic','Good Quality foods','Good Place and Atmosphere',
                           'Good Bars and Pubs', 'Good Scottish Breakfast',
                           'Good Fish and Chips with nice peas','Good Thai Foods with decent prawn',
                           'Good Indian Foods', 'Good Menu','Good Beef Burger with decent sweet potatoes',
                           'Good Price and Value','Good Sandwiches', 'Good Italian Foods (Pasta and Pizza)',
                           'Good chocolates and ice creams','Good Coffee Shops with Nice Artisan and Expresso',
                           'Good Japanese foods (Bento,Nigiri,Kanpai)','Good Mexican Burritos and Tacos',
                           'Excellent Services', 'Good Afternoon Tea', 'Good BBQ Shops (crackling pork, haggis)',
                           'Unknown German Reviews', 'Friendly Staff','Good Foods', 'Good Vegetarian Restaurants',
                           'Good Potato Shops'
                           ]
       } 

topic_table_good = pd.DataFrame(data_good)
display(topic_table_good)

2.5.3 Comments

Opportunities to explore when opening a restaurant in Edinburgh:

  • Customers are not satisfied with the type of batter for fish and chips in Edinburgh, which was complained to be too soggy. (Bad Review Topic 5)
  • Customers are also not particularly happy with chicken dishes in Edinburgh. (Bad Review Topic 3)
  • Foods and drinks at a cafe where JK Rowling visited were not satisfying and received a lot of complaints. (Bad Review Topic 10)
  • Sweet and sour chickens are not satisfying at chinese restaurants in Edinburgh. (Bad Review Topic 13)
  • Thai Foods for lunch are not good (Bad Review Topic 25)
  • Current noodle shops in Edinburgh are bad (Bad Review Topic 21)
  • Good atmosphere, location, friendly staff and price-to-value could enhance customer experience (Good Review Topics 2,3,11 and 22)

Areas to avoid and improve when opening a restaurant in Edinburgh:

  • Customers complained about bad services from rude and unfriendly waiter and waitress (Bad Review Topics 1 and 12)
  • Customers complained about long wait (Bad Review Topic 8)
  • Customers complained about room temperature (Bad Review Topic 24)
  • Customers complained about bars and pubs that are too quiet (Bad Review Topic 16)

For sectors where businesses are doing very well would also mean new entrants could face a big challenge to enter these sectors.

  • Good Review Topic 4 appeared in 16.57% of reviews, that implies bars and pubs are providing very good services in Edinburgh.
  • Good Review Topic 15 appeared in 13.32% of reviews, that means current coffee shops are providing very good coffees and services.
  • BBQ shops are doing very well in Edinburgh. (Good Review Topic 20)
  • Potato shops are doing very well in Edinburgh. (Good Review Topic 25)

In conclusion, insights above that were extracted by topic modelling are very useful for a company that plans to open a new restaurant in Edinburgh, for example, it was clear that customers are not satisfied with the current fish and chips and chicken dishes in Edinburgh, a new restaurant could potentially target these specific sectors to fullfill the customer demand. Also, company should be careful with certain issues that were discovered from bad reviews, for example, topics such as unfriendly staffs, long wait and uncomfortable room temperature that frequently appeared in bad reviews could negatively affect customer experience. Finally, sectors that are doing very well in Edinburgh such as bars, pubs, BBQ and Potato shops are the areas that a new restraunt may not want to consider in order to avoid fierce competition.

3 Location and Competition - Analysis

This part of the analysis estimates the demand for restaurants in a zip code for Edinburgh and finds out which type of cuisine people are looking for.

The following steps were performed: 1. Data cleaning to extract zip codes and restaurant types; 2. Integrated external demographic data; population, income, and hotels data with the Yelp data. 3. Calculate metrics for each area in order to address the main objective.

3.0.1 Preparing the Data for Competition Analysis.

We used the pandas Python package to filter the data to just that relevant to Edinburgh. Noting that the data contained many business types, we made the assumption that the business was a restaurant if it’s category field stated resturant. We then used this filtered data to extract the relevant tables for check-ins, reviews, tips and users. This provided us with data on 1215 restaurants in Edinburgh that were registered on Yelp.

Regarding restaurant cuisine type, we noticed that many different attributes were used to explain the restaurant category. For example: “Gastropubs,Bars,Scottish,Nightlife,Restaurants”. As we were primarily interested in cuisine, we made a list of cuisine keywords such as “Italian” or “Scottish”. We searched the category for such keywords in order to assign each restaurant to a cuisine type. If there more than one cuisine was detected, we will focused on the cuisine that was the most expressive of the type of food.

In order to combine the external demographic data, we extracted the zip code from each restaurant’s address. In doing so we only focused on the general area, the first three or four letters of the zip code. For example EH6 or EH12.

The data obtained from the Edinburgh government website [4] provided locality and ward demographic profiles. It contained data on; gender, age, housing, employment, education and professions, income, benefits, health and disability, lifestyle, satisfaction with services, and Scottish Index of Multiple Deprivation data - of which population, income and property prices were of primary interest. However, this data was organized by ward, with each ward containing many zip codes. Figure 1 maps different polling districts and figure 2 maps wards districts.

Figure 1: Polling Districts

“Figure 1: Polling Districts”

Figure 2: Wards

“Figure 2: Wards”

To attribute ward demographic data to each zip we used an external data set of polling districts from the City of Edinburgh Mapping portal [5]. Each polling location has a zip code and a ward. Where a zip code makes up the majority of occurances in the ward, the demographic data is all attributed to the zip. For example if the ward “Almond” has 4 zip codes in EH4 and 1 in EH2, EH4 will be considered as Almond ward, and takes the Almond demographic data.

As some zip codes such as EH1 and EH2 are really small, some zip codes were not assigned to any single ward using the method above. For these zip codes, population numbers were attributed in proportion to how many polling stations zip’s are in the ward. For example the central city ward in general is made up of EH1 and EH2. So the population for central city ward is assigned evenly into the populations of EH1 and EH2. Income and property prices remained the same.

Data on all the hotels in Edinburgh was also downloaded from booking.com and integrated. This data was cleansed, to get hotel counts by zip code.

import json
import pandas as pd
from glob import glob

def convert(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(v)
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob


def convert2(x):
    ''' Convert a json string to a flat python dictionary
    which can be passed into Pandas. '''
    ob = json.loads(x)
    for k, v in ob.items():
        if isinstance(v, list):
            ob[k] = ','.join(str(v))
        elif isinstance(v, dict):
            for kk, vv in v.items():
                ob['%s_%s' % (k, kk)] = vv
            del ob[k]
    return ob

business = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_business.json")])
checkin = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_checkin.json")])
review = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_review.json")])
tip = pd.DataFrame([convert(line) for line in file("yelp_academic_dataset_tip.json")])
user = pd.DataFrame([convert2(line) for line in file("yelp_academic_dataset_user.json")])

cusine = ['Scottish', 'Mediterranean','Chinese','French','Italian', 'Thai',  'Indian', 'British', 'European', 'Greek', 'Nepalese', 'German', 'Turkish', 'Mexican', 'Pakistan', 'Seafood', 'Japanese', 'Vegetarian', 'Brazilian', 'American','Spanish','Bakeries', 'Gastropubs', 'Soup','Caterers','Cafes','Smoothies','Bistros','Pubs', 'Coffee & Tea','Burgers', 'Delis','Fast Food', 'Sandwiches', 'Pizza', 'Fish & Chips', 'Polish', 'African', 'Korean', 'Middle Eastern', 'Creperies', 'Brasseries','Chicken Wings'  ]

def cusineCategory(x):
    for cus in cusine:
        category = x.categories.encode('ascii','ignore')
        if cus in category:
            return cus

def getZip(x, name):
    start= x[name].find("EH")
    Zip = x[name][start:start + 4].replace(" ", "")
    return Zip
   

edinburgh = business[business.city == "Edinburgh"]
edinburgh = edinburgh[edinburgh['categories'].str.contains("Restaurants")]
edinburgh['cusine'] = edinburgh.apply(cusineCategory, axis = 1)
edinburgh['zip'] = edinburgh.apply(getZip,args=('full_address',),axis=1)
edinburgh = edinburgh[edinburgh.zip != "EH17"]
edinburghid = pd.DataFrame(edinburgh.business_id)
edinReview = pd.merge(edinburghid,review, how='left' ,on="business_id")
edinCheckin = pd.merge(edinburghid,checkin, how='left' ,on="business_id")
edinTip = pd.merge(edinburghid,tip, how='left' ,on="business_id")

edinUserIds = pd.concat([edinReview.user_id,edinTip.user_id])
uniqueIds = pd.DataFrame(edinUserIds.unique())
uniqueIds.columns = ['user_id']

edinUsers = pd.merge(uniqueIds,user, how = 'left' , on = 'user_id')

edinburgh.to_csv('edinburgh.csv', encoding = 'utf-8')
edinReview.to_csv('edinReview.csv', encoding = 'utf-8')
edinCheckin.to_csv('edinCheckin.csv', encoding = 'utf-8')
edinTip.to_csv('edinTip.csv',encoding = 'utf-8')
edinUsers.to_csv('edinUser.csv',encoding='utf-8')


polling = pd.read_csv('polling.csv')
polling['zip'] = polling.apply(getZip,args=('POSTCODE',),axis=1)
polling = polling[['NEWWARD','zip']]
pollGroup = polling.groupby(['NEWWARD','zip']).size().sort_values(ascending=False)
pollGroup

wardsList = ['Almond' , 'Drum Brae Gyle' ,'Pentland Hills', 'Forth' , 'Forth.1', 'Inverleith' , 'Corstorphine Murrayfield' , 'Sighthill Gorgie' , 'Colinton Fairmilehead' , 'Fountainbridge Craiglockhart' , 'Meadows Morningside' , 'City Centre' ,'City Centre.1', 'Leith Walk' , 'Leith ' , 'Craigentinny Duddingston' , 'Southside Newington' ,'Southside Newington.1','Liberton Gilmerton' , 'Portobello Craigmillar']
wards = pd.read_csv('rawwards.csv')
wards['Forth.1'] = wards['Forth']
wards['Forth.1'][0] = "EH5"
wards['Forth.1'][1:17]= pd.to_numeric(wards['Forth'][1:17], errors='coerce')/2
wards['Forth'][1:17] = wards['Forth.1'][1:17]
wards['Southside Newington.1'] = wards['Southside Newington']
wards['Southside Newington.1'][0] = "EH9"
wards['Southside Newington.1'][1:17]= pd.to_numeric(wards['Southside Newington'][1:17], errors='coerce')/2
wards['Southside Newington'][1:17] = wards['Southside Newington'][1:17]
wards['City Centre.1'] = wards['City Centre']
wards['City Centre.1'][0] = "EH2"
wards['City Centre.1'][1:17]= pd.to_numeric(wards['City Centre'][1:17], errors='coerce')/2
wards['City Centre'][1:17] = wards['City Centre.1'][1:17]
wardsMelt = pd.melt(wards, id_vars=['Indicator'], value_vars=wardsList)


zipCode= ""
wardName = ""
wardsMelt['Zip'] = ""
for i, row in wardsMelt.iterrows():
    if row['Indicator'] == 'Zip':
        zipCode = row['value']
    wardsMelt['Zip'][i] = zipCode
    
wardsMelt = wardsMelt[wardsMelt.Indicator != 'Zip']
wardsMelt.columns = ['indicator','type','value','Zip']
wardsMelt[wardsMelt['type'] == 'City Centre']
wardsMelt.to_csv('edinPopulation.csv')

3.1 Analysis

We used Tableau to undertake an initial exploratory data analysis. Figure 3 below shows the number of stores in Edinburgh. Figure 4 shows the type of cuisine that has the most stores. We noticed that in general, Italian and British cuisine are the most popular. The city centre (zip codes EH1 and EH2) contained the greatest number of restaurants.

Figure 3: Density Map of Restaurants/ Figure 4 Number of Restaurants by Cuisine

“Figure 3: Density Map of Restaurants/ Figure 4 Number of Restaurants by Cuisine”

We also looked at the relationship between cuisine type and the number of reviewed. We decided to use the amount of reviews for a restaurant as an indicator of the number of visitors. We decided against using the number of users as an indicator because a user can visit a restaurant twice and give two reviews which will be counted as two visits. This information was then used as an estimate of demand within each zip code.

Figure 5: Cuisines with Reviews

“Figure 5: Cuisines with Reviews”

Figure 5 shows that British food has a very high number of reviews in Edinburgh, combined with the knowledge that there are more Italian than British restaurant in general, it might show that British restaurants are quite popular.

In certain places, such as EH2 which is shown below in Figure 6 shows that there are more Italian than British restaurants and so we can capitalize on the lack of British food to open a British restaurant there.

Figure 6 EH2 Number of Stores and Amount of Cuisines in Edinburgh

“Figure 6 EH2 Number of Stores and Amount of Cuisines in Edinburgh”

Interestingly, the restaurants in the city centre (EH1 and EH2) had the lowest rating average rating (Figure 7). An explanation for this is that there are many restaurants in the city centre and hence many low scores pull down the overall average. From this we can potentially infer that if our restaurant wants to focus on ratings, we should not try to put our restaurant in the city centre.

Figure 7: Average rating per zip code

“Figure 7: Average rating per zip code”

We decided to calculate the demand using the number of reviews. We understand the number of reviews may not be the best indicator of how many customers a restaurant had, but it served as a good proxy in the absence of more concrete data. Figure 8 shows that the locations with the most reviews are EH1, EH2, EH3 and EH8. We used this information in order to focus our analysis on these areas.

Figure 8: Number Reviews per Restaurant and zip

“Figure 8: Number Reviews per Restaurant and zip”

stats = ['All','Average annual household income','Average property value']

population = wardsMelt[wardsMelt.indicator == stats[0]]
population.value = pd.to_numeric(population.value)
populationGroup = population.groupby('Zip').value.sum()

income = wardsMelt[wardsMelt.indicator == stats[1]]
income.value = pd.to_numeric(income.value)
incomeGroup = income.groupby('Zip').value.sum()

propertyValue = wardsMelt[wardsMelt.indicator == stats[2]]
propertyValue.value = pd.to_numeric(propertyValue.value)
propertyGroup = propertyValue.groupby('Zip').value.sum()

restaurantGroup = edinburgh.groupby('zip').cusine.count()

    
hotels = pd.read_csv('edinHotels.csv')
hotels['zip'] = hotels.apply(getZip,args=('zip',),axis=1)

hotelsGroup = hotels.groupby('zip').address.count()

statistics = pd.concat([populationGroup/restaurantGroup,incomeGroup/restaurantGroup,propertyGroup/restaurantGroup, hotelsGroup/restaurantGroup,restaurantGroup],1)
statistics.columns = ['pop/rest','income/rest','property/rest','hotels/rest','Num of Rest']
statistics = statistics.dropna()

We generated metrics for each zip code (table 1). We considered the ratios of: total population per zip to number of restaurants per zip; average income per zip to the number of restaurants per zip; the total number of properties to the number of restaurants per zip; and the total number of hotels to the number of restaurants per zip.

Table 1: Metrics on zip code

“Table 1: Metrics on zip code”

The results of our exploratory analysis led us to choose to build a restaurant at a central location (EH1, EH2, EH3 and EH8). We refer to these areas as our “candidate areas”. EH8 has the largest ratio of population to restaurants compared to the other candidate areas. However, it also has the lowest number of hotels per restaurant. This may mean that there are fewer tourists in the area (who we may wish to attend our restaurant). In contrast, EH3 has attractive population to restaurant and hotel to restaurant ratios. Furthermore, its property value to restaurant ratio is low potentially making it not too expensive to buy a restaurant there.

3.2 Recommendation

Figure 9: Density Map of Restaurant and Number of Restaurant by Cuising for zip EH3

“Figure 9: Density Map of Restaurant and Number of Restaurant by Cuising for zip EH3”

As mention in the analysis, we decided that we will focus on the zip code of EH3. As shown above in Figure 9, British cuisine is not too saturated as there are actually more Chinese restaurants than British restaurants and also a lot of Indian Italian and Thai restaurants. Further using Figure 5, British restaurants seems to be the most popular and hence we should capitalize on the popularity.

Using the topic modeling method, two recommendations we found is customers are not satisfied with the type of batter for fish and chips in Edinburgh, which was complained to be too soggy and customers are also not particularly happy with chicken dishes in Edinburgh. Therefore in EH3, we should focus on being really good at making chicken as well as Fish. We further recommend that “Chicken and Fish” are the only food offered in the restaurant so that we can specialize in making it good.

4 Analysis - Type and Characteristics

In this part, we will classify different kind of restaurants with hierachical clustering technique to find out different types of restaurants. After that we would analyse the distribution of these restaurants and compare it with the distribution of reviews given to these restaurants. By analysing the distribution difference between top reviews and restaurants, we can find out areas where customers do not receive enough satisfying services and locate a strategic location for our restaurant.

4.1 Restaurants Types Analysis

We have total 1215 restaurants in our dataset with 55 of them having no attributes at all. Amount these 1215 restaurants, we clustered them in to 5 different categories to analyse the location and density of different type restaurants in different regions within the city.

After considering different distance metrics and linkage, Euclidean distance metric and ward linkage are chosen for their simplicity and better structured dendrogram.

#Consider a few different linkage 
df3_filtered=df3.loc[(df3!=0).any(1)]
df3_filtered=df3_filtered.iloc[:,0:58]
R1=hclst(df3_filtered.iloc[:,1:58],"ward","euclidean")
df3_filtered
#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()

plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()


from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
Fig 1: Dendrogram on style clustering

“Fig 1: Dendrogram on style clustering”

Now to determine cuts by different measures.

#To determine cuts by different measures
from scipy.cluster.hierarchy import inconsistent
depth = 20
incons1 = inconsistent(R1[0], depth)
incons1[-20:]
plt.figure(figsize=(25, 10))
plt.title('Inconsistency Measure of last 20 Merges with deth ' + str(depth))
plt.xlabel('Merges')
plt.ylabel('Inconsistency')
plt.plot(incons1[-20:,3])
plt.show()
#Determine cuts By Elbow
plt.figure(figsize=(25, 10))
plt.title('Height of the last 20 Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Height')
last = R1[0][-20:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
plt.show()

plt.figure(figsize=(25, 10))
plt.title('Rate of increase in Height between Merges')
plt.xlabel('Merges from the last merge')
plt.ylabel('Rate')
acceleration = np.diff(last, 2)  # 2nd derivative of the distances
acceleration_rev = acceleration[::-1]
plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()


from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters
Fig 2: Inconsistency Measure of last 20 merges

“Fig 2: Inconsistency Measure of last 20 merges”

Fig 3: Height of the last 20 merges

“Fig 3: Height of the last 20 merges”

Fig 4: Rate of increase in Height of the last 20 merges

“Fig 4: Rate of increase in Height of the last 20 merges”

From the plots, a cut with 5 clusters gives the most desired result. Now to visualise the results.

from scipy.cluster.hierarchy import fcluster
max_d = 15
clusters = fcluster(R1[0], max_d, criterion='distance')
df3_filtered['cluster']=clusters

#To visualise results
C1=df3_filtered[df3_filtered.cluster==1]
C1=C1.drop(C1.columns[57], axis=1)
C1=C1.describe()
C2=df3_filtered[df3_filtered.cluster==2]
C2=C2.drop(C2.columns[57], axis=1)
C2=C2.describe()
C3=df3_filtered[df3_filtered.cluster==3]
C3=C3.drop(C3.columns[57], axis=1)
C3=C3.describe()
C4=df3_filtered[df3_filtered.cluster==4]
C4=C4.drop(C4.columns[57], axis=1)
C4=C4.describe()
C5=df3_filtered[df3_filtered.cluster==5]
C5=C5.drop(C5.columns[57], axis=1)
C5=C5.describe()

plt.figure(figsize=(25, 10))
plt.title('Cluster 1:Unknown Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C1.ix[1,1:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==1]
Fig 5: Cluster 1: Unknowns

“Fig 5: Cluster 1: Unknowns”

Cluster 1 contains 382 restaurants. Most of these restaurants do not provide much information for classification.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 2:Classy Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C2.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==2]

Fig 6: Cluster 2: The Classies Cluster 2 contains 119 restaurants. They are the most expensive and classy type of restaurants, providing mainly dinner service. Most of them require customers to be dressy. They are more the quiet type of restaurants and some of them provide background music. Those stylish restaurants with classy, romantic, intimate, trendy and upscale atmosphere are mainly in this cluster. They are generally good for groups and dating.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 3:Stylish restaurants and Pubs & Bars')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C3.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==3]
Fig 7: Cluster 3: The Stylish and Pubs & Bars

“Fig 7: Cluster 3: The Stylish and Pubs & Bars”

Cluster 3 contains 310 restaurants. These are stylish restaurants and pubs & bars priced at level 2, providing mainly dinner and lunch.

They are less quiet, few of them can even be loud or very loud. This clusters contain most of the restaurants that provide dj, jukebox, live and video music and also TV boardcast. Some of these restaurants are quite stylish with lipster, intimate or trendy style. They are mainly good for groups.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 4: Fastfood & Takeaways')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C4.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==4]
Fig 8: Cluster 4: The Fastfood and Takeaways

“Fig 8: Cluster 4: The Fastfood and Takeaways”

There are 171 restaurants in cluster 4. These are mainly the cheapest kind of restaurants, probably fastfood restaurants as most of them do not provide waiter service, wine nor having any bars. Their embience level is quite average. They are mainly good for kids and also suitable for groups.

plt.show()
plt.figure(figsize=(25, 10))
plt.title('Cluster 5: General Restaurants')
plt.xlabel('Attributes')
plt.ylabel('Proportion of population')
C5.ix[1,0:56].plot(kind='bar',ylim=[0,1])
df3_filtered[df3_filtered.cluster==5]
Fig 9: Cluster 5: The Generals

“Fig 9: Cluster 5: The Generals”

Cluster 5 contains 233 restaurants. They are mainly general restaurants, providing brunch, dinner and lunch. They are less quiet but not the noisiest type. This restaurants are not very stylish and almost none of them provide any music. They are generally group and kids friendly.

4.2 Performance Analysis

Now we can visualise the distribution of different types of restaurants in different zips and analyse the supply of different kind of restaurants.

df3_filtered["zip"]=df2["zip"]
df3_filtered.to_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/d3_filtered.csv")
from __future__ import division
df3_filtered["Type"]=df3_filtered.cluster
df3_filtered.Type=df3_filtered.Type.map({1:"Unknown", 2:"Classy",3:"stylish and Pubs & Bars",4:"Fastfood & Takeaways",5:"General"})
supply=pd.crosstab(df3_filtered["Type"],df3_filtered["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
supply
Table 1: Restaurants distributions across zips

“Table 1: Restaurants distributions across zips”

Table 1 above shows the percentage of distribution of the same type of restaurants across the zips, each row sums up to 100% of that respective style.

review=pd.read_csv("C:/Users/cheukkin.Warwick/Desktop/dm_group/final_group_project/Kane Wu/edinReview.csv",header=0)
review=review.merge(df3_filtered, on=['business_id'], how='outer')
review_stars=review[((review.stars==4) == True) | ((review.stars==5) == True)]
top_views=pd.crosstab(review_stars["Type"],review_stars["zip"]).apply(lambda r: r*100/r.sum(), axis=1)
top_views

Similarly, we can look at the the distribution of top rating review for each restaurant types in different zips and use it as an indicator of popular location for certain type of restaurants.

Table 2: Reviews distributions across zips

“Table 2: Reviews distributions across zips”

Table 2 above shows the percentage of distribution of reviews for the same type of restaurants across the zips, each row sums up to 100% of that respective style.

From the tables the above, we further calculate the ratio of restaurants receiving top reiviews. For instance, fastfood and takeaways in EH1 receives 39.7% of top reviews in the Fastfood & Takeaways category with only 22.8% of Fastfood & Takeaway restaurants located in the zip. It implies that the quality of Fastfood & Takeaways in EH1 are generally more satisfying than those located in other zip.

To observe the differences, we can look at Table 3 below.

Table 3: Disagreement of percentage distributions of review and restaurant across zips

“Table 3: Disagreement of percentage distributions of review and restaurant across zips”

To observe the performance level, we created a ratio as an indicator of high quality region as an indicator of out-performing region in Table 4 below.

\[\frac{(\text{top review%} - \text{restaurants%})}{\text{%restaurants}}\]

Table 4: Ratio of disagreement of percentage distributions of review and restaurant across zips

“Table 4: Ratio of disagreement of percentage distributions of review and restaurant across zips”

(top_views-supply)/supply

From the calculation above, we can find an out-performing zips for different types of restaurants.

Out-performing regions:

Classy: EH2, EH7, EH8, EH1

Fastfood & Takeaways: EH1, EH7, EH3

General: EH2, EH1, EH3, EH8

Stylish and Pubs & Bars: EH1, EH8, EH2

Under-performing regions:

Classy: EH5, EH10, EH12, EH11

Fastfood & Takeaways: EH14, EH16, EH12, EH14, EH6

General: EH12, EH16, EH15, EH4

Stylish and Pubs & Bars: EH10, EH11, EH5, EH3

4.3 Sectional Recommendations

Now, depends on the estimated quality of our service, we make strategic considerations based on the the result.

4.3.1 Strategy 1 - Competition Avoidance:

We locate our restaurant at a zip with a relatively lower percentage of top reviews than the the percentage of restaurants to avoid competition with top quality restaurants and take up those unsatisfied demand in the zip. However, as the amount of reviews is also a proxy of traffic, we may want to as well avoid under-performing zips as well. Hence, the best location would be zips having the supply demand ratio close to zero while having a certain amount of of restaurants of the same type.

To select the best location, we first locate the zips with performance level clost to zero and then look for the zips with the highest amount of restaurants of the same type. The recommendations are listed below.

Classy: EH6

Fastfood & Takeaways: EH8, EH3

General:EH8, EH9

Stylish and Pubs & Bars: EH2, EH15

4.3.2 Strategy 2 - Business Agglomeration:

Being very confident on the quality of our service, We locate our restaurant at a zip with high out-performing ratio to save our advertising cost and take advantage of the popularity of the location to ensure the source of customers. Here are the recommendation of post codes for different type of restaurants.

To select our best location, we first locate the zips with highest performing ratio and then look for the zips with the highest amount of restaurants of the same type.

Classy: EH2

Fastfood & Takeaways: EH1

General: EH2

Stylish and Pubs & Bars: EH1

4.3.3 Choice and recommendation for our restaurant

4.3.3.1 Choice of strategy

For strategy 2, it involves renting a place in the most popular region. As a result, the cost of investment would be higher than strategy 1. Without further knowledge about the taste and preference of the customers, strategy 2 will impose much more risk than strategy 1. Therefore in the experimental stage, strategy 1 is more preferred for its lower risk exposure.

4.3.3.2 Choice of location

For Fastfood & Takeaways, both EH3 and EH8 have a performance ratio close to zero and a fair amount of restaurants in the region. However EH3 and EH8 are quite different in terms of our two main considerations.First, EH3 has a performance level closer to zero while EH8 has a more negative performance level. On the otherhand, EH8 has much more Fastfood and Takeaways shops then EH3.

4.4 Sessional Conclusion

Consider that the negative performance ratio may indicate a saturated supply and EH3 is therefore decided to be a better location than EH8. SUch decision agrees well with our location decision in our previous session.

5 Analysis - Price, Volume and Revenue

In order to calculate potential revenue, we have estimated the amount of customers a restaurant may receive. In order to do this we first considered the check-in dataset. However, we realised that it did not provide a good estimate of the number of customers as check-ins are based on the amount of offers and so represent a biased value for our analysis. The Yelp site [6] states that “Certain businesses offer discounts when yelpers check in to that business” and further states “You check in with the yelp app on an iPhone/iPad or Android device. You have to be within a close proximity to a location to check-in and the app used your phones GPS to measure your location.” Hence it is likely that most of the customer visits will not have a check-in event (even if the customer came from Yelp) as the customer will most likely not open the application and check-in.

british = edinburgh[edinburgh.cusine == 'British']
britishGroup = british.groupby('zip').cusine.count()
edinReview = pd.merge(reviews,british, how='inner' ,on="business_id")
reviewGroup = edinReview.groupby(['zip']).size()
grouped = pd.concat([britishGroup,reviewGroup, reviewGroup/britishGroup],axis=1)

We therefore decided to use a mutliple of the number of reviews for a british restaurant as an indicator for the number of visitors.(figure 10) We assumed that the number of reviews indicate approximately .1% of total visits. For EH3 there are 161 reviews for British restaurant and 17 British restaurants in total and hence there are 9.47 average number of Reviews per British Restaurant for EH3. We can expect that for a year there will be around 9470 customers per year (9.47 * 1000).

Table2: Average number of Reviews for British Restaurant per Zip

“Table2: Average number of Reviews for British Restaurant per Zip”

For pricing, we decide to take the average price for British Cusine at Eh3. As shown in figure10 below the average price range is around 2.53.

Figure 10: Average Price for British Cusine

“Figure 10: Average Price for British Cusine”

We used data on prices obtained from the Yelp site in order to calculate our expected revenue at a given level of price. The mapping of Yelp price symbols to dollar ($) value is given in table 3.

Table 3: Table 3: Yelp Symbol to Value

“Table 3: Table 3: Yelp Symbol to Value”

2.5 is between the second price range and the third price range so we decided to use the average price of 30 dollars. We calculated that if we open a restaurant with a Yelp price range of 2.5 we can expect to make 9470 * 2.5 which is $284,100 per year.

6 Final Recommendations

7 Reference

[1] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(8):1548–1560, 2011.

[2] J. Choo, C. Lee, C. K. Reddy, and H. Park. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics (TVCG), 19(12):1992–2001, 2013

[3] A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation. Wiley, 2009.

[4] Edinburgh ward and locality demographics. Sourced on June 9th 2016 from http://www.edinburgh.gov.uk/info/20247/edinburgh_by_numbers/1393/locality_and_ward_data_profiles

[5] City of Edinburgh Mapping Portal. Source on June 9th 2016 from http://data.edinburghcouncilmaps.info/datasets/2cee9b18a21344b0879c3c51d71fd2c6_28

[6] Yelp price symbol mapping to real values. Source on June 14th 2016 from http://www.yelp.com/topic/san-diego-can-anyone-give-me-the-actual-dollar-range-for-the-dollar-sign-symbols-in-rrgards-to-pricing